import pyclimb as pc
from pyclimb.vis_func import clus_map, heat_map
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

Loading and Cleaning#

First, use the load_data function to load in the preloaded datasets for the demo. The preloaded datasets include the climbing dataset both in clean, and in raw form, a weather dataset from Utah weather stations, and a Utah cities dataset scraped from the web. For this demo, we are using the raw form of the data set, to demonstrate how to use the built in cleaning function.

If reading in your own data from mountainproject.com. Put all of your files in the same working directory and use the pc.concat() function.

# if using your own files dowloaded from mountain project,then uncomment this code
# climbs = pc.concat(['route-finder(1).csv', 'route-finder(2).csv', 'route-finder(3).csv'])
climbs = pc.load_data('raw')
weather = pc.load_data('weather')
cities = pc.load_data('cities')

You can then clean the data using the pc.clean() function.

pc.clean(climbs, inplace = True)

Scraping#

Additionally, there is a scraper function that will collect additional data from each climb, the crawl-delay requested by mountain project.com, is 60 seconds, so for the purposes of this demo, I will leave it commented out, but if you are using your own data feel free to uncomment it.

# pc.scrape_mp(climbs, inplace = True) # uncomment this section to scrape from MP
# I have already scraped the data from MP and it is in the clean dataset below
climbs = pc.load_data('clean')

Merging the data#

Once you have your cleaned data, use merge_data_dist to merge the desired dataframes based on the closest latitude and longitude. This example merges the climbs dataset with two others, weather stations and cities

climbing = pc.merge_data_dist(climbs, weather, 'STATION_NA', 'Area Latitude', 'Area Longitude', 'LATITUDE', 'LONGITUDE')
# function adds distance variable by default, so we rename it to be more specific
climbing.rename({'Distance' : 'station_dist', 'Location' : 'climb_location'}, inplace = True, axis = 1) 
climbing = pc.merge_data_dist(climbing, cities, 'Location')
climbing.rename({'Distance' : 'city_dist', 'Location' : 'city'}, inplace = True, axis = 1)

climbing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6107 entries, 0 to 6106
Data columns (total 34 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Route           6107 non-null   object 
 1   URL             6107 non-null   object 
 2   Avg Stars       6062 non-null   float64
 3   Rating          6107 non-null   object 
 4   Pitches         6107 non-null   int64  
 5   Length          5474 non-null   float64
 6   Latitude        6107 non-null   float64
 7   Longitude       6107 non-null   float64
 8   PG13            6107 non-null   bool   
 9   R               6107 non-null   bool   
 10  State           6107 non-null   object 
 11  Region          6107 non-null   object 
 12  climb_location  6107 non-null   object 
 13  Crag            5874 non-null   object 
 14  Wall            4702 non-null   object 
 15  Trad            6107 non-null   bool   
 16  Alpine          6107 non-null   bool   
 17  TR              6107 non-null   bool   
 18  Aid             6107 non-null   bool   
 19  Boulder         6107 non-null   bool   
 20  Mixed           6107 non-null   bool   
 21  Rating_num      6107 non-null   float64
 22  numVotes        6107 non-null   int64  
 23  numViews        6107 non-null   int64  
 24  Year            6107 non-null   int64  
 25  ViewsPerMonth   6107 non-null   int64  
 26  Shared_by       6107 non-null   object 
 27  Month           6107 non-null   int64  
 28  Day             6107 non-null   int64  
 29  Date            6107 non-null   object 
 30  STATION_NA      6107 non-null   object 
 31  station_dist    6107 non-null   float64
 32  city            6107 non-null   object 
 33  city_dist       6107 non-null   float64
dtypes: bool(8), float64(7), int64(7), object(12)
memory usage: 1.3+ MB

Maps#

Once the datasets are merged, you can make an interactive cluster or heat map. This uses the latitude, longitude, and a description for each point. These maps save by default but can be printed to screen instead by using save = False.

clus_map(climbing, desc = 'Route', save = False)
Make this Notebook Trusted to load map: File -> Trust Notebook

heat_map(climbing, desc = 'Avg Stars', save = False)
map

Analysis#

Here we use scikit-learn to perform a weighted linear regression and analyze the data.

# Drop na values
climbing.dropna(inplace = True)

# Target encode categorical features
label_encoder = LabelEncoder()
categorical_variables = ['Region', 'climb_location', 'Crag', 'Wall', 'city']
for var in categorical_variables:
    climbing[var] = label_encoder.fit_transform(climbing[var])

# fit the model
reg_mod = LinearRegression()
X = climbing.drop(['Avg Stars', 'numVotes', 'URL', 
                   'Route', 'Rating', 'Shared_by', 
                   'STATION_NA', 'State', 'Date'], axis = 1)
y = climbing['Avg Stars']
weights = climbing['numVotes']

reg_mod.fit(X, y, weights)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}